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ABSTRACT 

Cross-validation in relation to choosing the ^^st 
tests and selecting the best items in tests is discussed. 
.Cross-validation demonstrated whether a decision derived from one set 
of data is truly effective when this decision is applied to another 
independent, but relevant, sample of people. Cross-validation is 
particularly important after statistical data have been used to 
choose the best tests to make up a battery for use with the next 
group of people. A cross-validation experiment on a new group will 
tell how good the choice of test really is. Item analysis is- a means 
of improving tests. The data from the analysis are used to eliminate 
doubtful items and to determine the best scoring weights. By applying 
the revised test to a new independent group, the inventory is 
refined. When the test is, without change, administered to an 
entirely new and independent set of criterion groups, cross 
validation data are obtained. (For related document, see TM 002 947») 
(DB) 
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CROSS-VALIDATION 



PEOPLE keep asking us, ^'What s this talk about cross-validation?** Perhaps this is a good time to explain 
what we think the jargon is all about. In the simplest language wc think cross-validation means taking 
another independent look, especially verifying a first choice or checking tip on a huncli. Tlie idea seems 
to us to be hoary with age. At least the notion of taking a second look was well established in horse-and-buggy 
days. The driver, you remember, was cautioned at every grade crossing to Stop, Look, and Listen. Fancy lan- 
guage attaches to such a primitive notion only because the complexities of choosing the best tests for some pur- 
pose and selecting the best items in test construction introduce special dilficulties. 

The problem of cross-validation is the problem of getting an independent verification, and the special diffi- 
culties we have in our work of personnel testing arise in our need to select tlic items that look best and the tests 
that look most useful from a large number of possibilities. We believe in the experimental iipproach. Wc like to 
try out tests and items and choose the "best" after seeing the results. Statistics — especially correlation coefficients 
and item analysis statistics — get into the play. 

We really have two problems. The first is to find the right way to choose the ''best** of a number of possibili- 
ties. The second is to find out how good our best choice actually is. Cross-validation is concerned principally 
with the second problem. 



We often think in this connection of one of oiu* 
bright-eyed friends, Angelo, our barber around the 
comer. He thinks he has learned quite a bit about 
statistics from us and as a matter of a fact he has. 
Greatly to his sorrow, he cross-validated a selection 
system by experimental verification, using horses as 
subjects. But that is the way things are in this life. 
Empirical studies and particularly cross-validation 
have an insidious way of destroying confidence in sys- 
tems of evaluation and prediction. Our friend .\ngelo, 
however, is a persistent fellow. He has given up the 
ponies, but he is still looking for the practical gim- 
mick. He is now working on a fascinating plan, guar- 
anteed by logic, wholly objective and backed up, he 
says, by extensive cross-validation statistics. 



He starts with a pool of one hundred thousand 
equivalent items, each one as comparable to the others 
as the pennies in the pigg>' bank. In fact the items 
are Lincoln pennies, and the price is about as econo- 
mical as an item can be. Angelo has set up a simple 
scheme for administering and ^coring these items. He 
flips a coin and scores the obverse side (heads, to you) 
Republican. Of course the reverse side is classified 
Democrat. (AngeJo denies that the fact that Lincoln 
teas a Republican introduces bias in the key. lie says 
the key is arbitrar>- and he will change it if anyone 
insists.) His theory is that the pennies that can predict 
(really post-diet) the 1900 election will be 'good" 
items. He has already tried out the 100,000 coins, and 
approximately 50.000 of them turned out to be scored 
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Republican. This wtis for the year that Roosevelt 
(Teddy) was elected, and these'50,000 coins clearly 
called the election. 

The 50,000 discriminating pennies look like good 
ones for betting purposes. But Angeloi experience 
wiili horses has convinced him that cross-validation 
is in order. He plans to repeat the experiment for the 
election year 1901 to identify the coins that were good 
enough for the 1900 study -but not good enough for 
1904. This process is called skinnning the cream of 
the item pool. Since there is no limit to the extent to 
which one can improve tests by careful screening of 
the items, Angelo s plan is to extend the process right 
down to the election of 1952. lie expects to wind up 
with half a dozen strictly comparable forms of the 
very 'best test fur the job of predicting elections, and 
he thinks they will be pretty valuable. Extended 
sequential experiments will have proved that these 
are the pennies that have successfully predicted four- 
teen consecutive elections. Angelo is sure that he will 
be able to retire in November 1956. 

Chance is the gremlin. The larger the number of 
predictors - be they tests, items, or pennies — the 
more - careful we must be to guard against being 
fooled by chance results that may "look" meaningful. 
;\ugelo should have reali'zed that the 50,000 pennies 
he picked in his experimental try-out for 1900 had no 
special virtue except that of the true impartiality with 
which chance endows all honest, two-sided pennies. 
And in every successive election year when he "skim- 
med the cream of his item pool" or "purified" his 
scoring system he made the same fundamental error. 



We don't know of any dictionary tliat defines cross- 
validation as psychologists use the term. Perhaps we 
can clarify the idea with examples. 

Suppose we find a test that is reputed to be a good 
selector of salesmen. We need such a test in the worst 
way and hope this is it. But we know that a man s 
success at selling vacuum cleaners door-to-door may 
be no good for predictini, how well he will do at 
selling a line of tools to hardware stores. So we try 
out the test in our client's line of selling, being careful 
to test an adequate number of applicants and to keep 
the test scores put away where they wont influence 
anyone until the performance records come in for 
checking. This is a validation experiment. An appro- 
priate statistical analysis will show whether the test 
scores correlate with our criteria of success as well as 
we expect from the evidence which led us to try the 



test. This t>q)e of study might well be called more 
valulatiou rather than cross-mlidalion.' If we apply 
the same test to several similar groups such as samples 
of salesmen in similar work and find that the several 
validity coelficients are about the same, we can have 
more confidence in using the test for selecting this 
general class of salesmen. Mosier, in the reference 
cited later, uses tlie expression "validity generaliza- 
tion" for this kind of result. 

Studies with tlie Differential Aptitude Tests provide 
abundant examples. Page 42 of die manual lists 36 
coefficients of correlation between Numerical Ability 
scores and .subsequent course grades in mathematics 
in several schools a.id classes. The coefficients range 
fiom .27 to .65; the median r is .47. In a loose sense 
the data contribute to cross-validation, but it is clearer 
to think of such correlation studies as examples of 
validity generalization. Many such studies in a variet>' 
of situations are required before a generality of con- 
fidence in test validity can arise hi the mind of the 
careful test user.* The notion of cross-validation tics 
in to the general validity question, but it is more 
specific to particular applications of tests. 

After we have selected some tests for practical use 
we usually set some cutoff score for each test or some 
combination of scores on several tests which will 
maximally eliminate potential failures and maximally 
include potential successes. We try to decide on ap- 
propriate cutoff scores or weightings of scores by 
studying the data on a sample of candidates. Wlicn 
we apply these decisions to a new sample of similar 
candidates we are ready to cross-validate our findings. 
That iSy we are ready to take a second look at the 
rules we decided on. If die cutoff or weighting system 
shows up well we have accomplished a favorable 
cross-validation and probably will adopt the system. 
However, the rcstdts may not be as good as we ex- 
pected. In this case the cross-validation study is nega* 
tlve. The results warn us that more research is neces- 
sary. The essence of the idea is that cross-validation 
demonstrates whether a decision derived from one set 
of data is truly effective when this decision is applied 
to another independent, but relevant, sample of 
people. 

Cross-\'alidation is particidarly important after we 
have used statistical data to choose the best tests to 
make up a battery for use with the next crop of appli- 
cants. A cross-validation experiment on a new group 

•See Test Sercice Bulletins 07 and 38 for disc«iS',iua of 
validation, especially on the need for many studies. 
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will tcll Its how good our choice really is. Tlu- purpose 
of cross-validation is to protect us from being fooled 
into putting confidence in a relationship which /i/ip: 
pens to hold true for the group we started with, but 
which will let us down in the long run. And we don't 
set this protection unless we make sure: 

(a) that the scoring system and combination 
of tests picked on tlie 6rst grou:> is tried 
out unchanged on the second; 

(b) that the second group is a relevant 
sami)le of different people. 



Suppose that we have several hundred items samp- 
ling personality attributes, interests, or attitudes; and 
suppose further that we have paired criterion groups, 
say, successful and unsuccessful salesmen, or male 
and female sophomores, or bank presidents and clerks. 
Assume, too, that factors such as age and education 
have been controlled. Since we do not know what 
responses characterize the groups and especially do 
not know what responses identify the significant traits 
of the different groups, we administer the items as a 
test to the groups and perform a detailed item analysis. 
Wc are simply seeking empirically those items that 
arc usefully di.scriminating. The analysis provides a 
method of identifying some of the characleristi^^that 
distinguish between the tyjjes of person. The data are 
used to revise the experimental inventory, to guide 
the elimination of poor items (or at least those that 
look poor) and perhaps to determine the best scoring 
weights (or at least to determine what look hke the 
best scoring weights). 

After we have identified them we could score all the 
selected items according to the empirically determined 
key and see how well the scores differentiate the 
original groups or correlate with the criterion. This 
work would (louI>tless encourage us to press fon^^ard 
to publication of our apparently important findings, 
but being careful workers we realize that cross-valida- 
tion is desirable. We may also argue that the criterion 
groups are small and we are sure that the validity 
might be improved by further refinements. Let's try 
it again, we say. So let us suppose that at great cost 
and effort, wc are successful in obtaining new inde- 
pendent critcrior. groups. We now apply the revised 
test to these groups. 

It is obvious that every test can be improved, and 
in the process of analvT^ing the second sample we note 
that some of the items are apparently miskeyed, or 
the scoring weights seem wrong. We plan, therefore, 



in the statistical v.'ork to include an independent 
second iteni analysis. Suroly Iwo i*om analyses are 
better than one, and we have only to find the best 
method of evaluating the resu'*s of the double item 
analysis to achieve greater validity. If we have 
enough items wc may eliminate aU the doubtful ones, 
namely those m which the 50c.:)nd item analysis fails 
to confirm fully the original findings. These items, we 
would say, are of doubtful validity, and if we have 
enough items, \ve prefer to discard them in favor of 
items that have been doubly proved or twice-validated. 
So we rcscore all the tests on b.')th samples tor the 
reduced number of items in t'lc .second revision and 
with the refin^^d key. We obtain scores and compute 
appropriate coefTicIents of correlation for th\* second 
independent sample and the original sample separate- 
ly and for the two f,roups combined. 

All this labor produces an impressive pile of data, 
and we may th'nk we have cross-validated our test 
and our weighting system. As a matter of fact, we 
have only refined the invento-y. We have as yet no 
validation data for the revised instruments. 

V/e do not have cr%)ss-validation data untfl we ad- 
minister the tests withcut Chans'*, without further 
revision or refinement, to an entirely new and inde- 
pendent set of criterion groups. 

Now there is nothing wrong in trying out items on 
a number of sample?. The more objective data we 
have the better should be our judgments about items 
to include. But when we have finallv put the items 
together and developed a scoring system, we should 
undertake a new validation study completely inde- 
pendent of the samples used in the developmental 
phases of the work. A published "validi^" coefficient 
based on the sample which contributed to the selection 
of the items and the making of the key (in the case of 
personality and irterest inventories) is misleading. 
Coefficients so derived should be unambiguously de- 
scribed. Tliey are not validity coefficients which tcll 
tlie practical user what he may expect if he uses the 
test or inventor}'. 

In this connection we recall an amusing experi- 
mental example. The experiment is simple and quite 
instructive concerning the way fortuitous accidents of 
sampling can affect the selection of items for an in- 
ventory. It happened that we had a conference of 
school people, ten high school principals and ten 
superintendents. To illustrate the point about cross- 
validation we offered to build right then and there 
what we call the Wardrobe Projective Inventory for 
Administrative Personnel. Tlie test is designed to dis- 
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ttiigiiisli behvceii principals and superintendents. The 
undei lying psychological construct in our test develop- 
Hunt is obviously txpresscd in the well-established 
truth. Xlothcs make tlie man." 

This IS the way to apply the theory-: ask each mem- 
bci of (he conference committee to answer about 50 
or (y() simple check-list items describing his warurobe 
of iUv day. For example, is his suit blue, grey or brown? 
Single breasted or double breasted? Shoes - black, 
brown, (wo-tone? Necktie - four-in-hand or bow, 
quiet or loud? Shirt - white, colored; plain or button- 
down collar; French or plain cuffs or half sleeves? And 
how about the socks? Wool, silk, cotton, orlon, dacron, 
nylon, plain or fancy? Get the facts .by such a check 
list and then let the statistician analyze the data. He 
will find which items distinguish the principals from 
the superintendents. He counts the frequency with 
which each answer characterizes each group. If we 
but try enough items some of them will surely turn 
out to look signiGcant in distinguishing between the 
men with diflferent jobs. 

It hirned out in our data that the questions about 
suits, shoes and necklics taken as single items did not 
cxhibil any usefu* significance. About the same num- 
ber of superintendents and principals chose brown 
suits and shoes and tow ties and white shirts. But the 
combination two-toned shoes, bow tie and pnstel shirt 
with button-down collar was significantly a superin- 
tendents choice. Not a single principal chose the 
combination. By further careful analysis of the fre- 
quency counts for combinations of items we chose 
the wardrobe check-list items of greatest differential 
significance. We derived from the data sufficient 
e-vperimcntal insight to construct a Superintendent- 
Principal key. The scores distinguished with 90% ac- 
curacy v>hcther a man in this group was a superinten- 
dent or principal. The question is: Is this really a 
valid measuring device? 

Though we have not tried to duplicate the experi- 
ment we are sure on logical grounds that we could 
find d].scriminating items in a new group of principals 
and superintendents. Btit the set of questions that 
would look ''signiGcant" on the second study would 
include different items. 

Duplication of the experimental process of item 
selection on independent samples of people would not 
be cro.'«-validation. True cross-validation would be a 
trial of the selected items on new groups. If we apply 
the principle of independence to the subsequent trials 
the cross-validation data will tell us how well the test 



really works. We surely expect that a trial of the 
Wardrobe test on independent gro ips of principals 
and superintendents would reveal futlifully the non- 
significance of the items of the chc'ck list. We will 
have caught up with the gremlins of cliancc. 

We have emphasized the tfst makers primary re- 
sponsibilily for cross-validating the selection and 
weighting of test items to produce good psycliometric 
instruments. The same fundamental ideas also apply 
to the test consumers rc>poni,ib.1ity for cross-validat- 
ing his use of tests in specific practical applicatioas. 
WTienever a variety of measuring devices is tried 
experimentally to guide the choice of the most valid 
battery some one of the tests will correlate best with 
the criterion. By appropriate statistical methods a 
combination of tests may be weighted '.o yield the 
greatest multiple cOii elation. It is certain that fortui- 
tous accidents of sampling have influenced the choice 
of tests and the weighting of the scores. The chance 
effects may be very imnortant — so important that 
repetition of the experiment might result in a different 
choice of tests or a very different re^-ression equation. 
A true estimate of the value of a weighting system, 
regression equation, nr a choice of cutoff scores on a 
single test should be derived by cross-validation in the 
test user*s actual application.- Cliarles R. iMn^muir. 

Note: Some of the Interesting ramifications of the problem 
are discussed in a symposium, "The Need attd Means of Cross- 
Validatfon," hy C. I. Mosicr, E. E. Cureton, \\. .\. Katzell, 
and R. J. Wherry in Educational and Psychohgical Measure- 
ment, Vol. II, No. U Spring 1951, and a classic experimental 
example is reported in •*Validit>', jR?<iability, and Baioney," by 
E. E. Cureton in the same journal. Vol 10, No. 1, Spring 1950. 
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